Causal Learning via Manifold Regularization

This paper frames causal structure estimation as a machine learning task. The idea is to treat indicators of causal relationships between variables as ‘labels’ and to exploit available data on the variables of interest to provide features for the labelling task. Background scientific knowledge or any available interventional data provide labels on some causal relationships and the remainder are treated as unlabelled. To illustrate the key ideas, we develop a distance-based approach (based on bivariate histograms) within a manifold regularization framework. We present empirical results on three different biological data sets (including examples where causal effects can be verified by experimental intervention), that together demonstrate the efficacy and general nature of the approach as well as its simplicity from a user’s point of view.

Suppose now that κ is injective and d P is a metric on P. Then if it holds that d S (S, S ) = d P (κ(S), κ(S )) = 0, it follows that κ(S) = κ(S ) which (from assumption on κ) implies S = S in S n . Thus under these additional assumptions, d S is a metric on S n .
Proof [Proposition 2] The non-negativity, symmetry and sub-additivity properties of d P are clear, so all that remains is to establish that d P (π, π ) = 0 implies π = π . From the definition of P, both π and π are continuous on Z 2 . The result is then immediate from the fact that, since π and π are continuous and Z 2 is compact, then Z 2 |π(z ) − π (z )| 2 dΛ 2 (z ) = 0 implies π and π must be identical as functions on Z 2 .
Proof [Proposition 4] This proof extends the simpler proof given for the univariate case in Theorem 6.11 of Wassermann (2006). For convenience, and without loss of generality, we suppose that Z 2 = [0, 1] 2 . It will be convenient in this section to re-assign the notation z as a dummy variable in Z 2 (instead of in Z p ). Let be the probability mass assigned to so that, from binomial properties, the mean and variance of the histogram estimator κ(S (n) )(z) at the point z ∈ Z 2 are Let b(z) = m(z) − π(z) denote the bias of the histogram estimator. The mean square of the error π(z) − κ(S (n) )(z) at a point z ∈ Z 2 can be bias-variance decomposed: The aim is to obtain independent bounds on both the bias and variance terms next.
To bound the bias term, Taylor's theorem gives that, for z, z ∈ B i,j , where the remainder term satisfies Here M max = max{M i,j } and ∇∇ π denotes the Hessian, which exists since π is twice continuously differentiable in Z 2 . Thus for z ∈ B i,j , integrating (7): where the new remainder term can be bounded: where the constant C is independent of z and i, j. The number 8 (which is not sharp) is obtained from trivial but tedious computation of the integral in (8) and bounding each term in the result. Now, for z ∈ B i,j , the bias is expressed using (7) as Now we integrate this expression over x ∈ B i,j : To bound these integrals we use Cauchy-Schwarz: and Both expressions in (9) and (10) are finite since the integrand is continuous and the domain is compact. The total integrated bias is thus bounded as To bound the variance term, from the integral form of the mean value theorem we have that, for some z i,j ∈ B i,j , The application of the integral form of the mean value theorem is valid since π is continuous on Z 2 . Then: Putting this all together to obtain a bound: where E denotes expectation with respect to sampling of the data S (n) ∼ Π. From inspection of (11), the estimator error vanishes provided that h is chosen such that nh 2 → ∞. Since convergence in expectation implies convergence in probability, we have established that π − κ(S (n) ) L 2 (Λ 2 ) = o P (1). The bandwidth h * , which minimizes the upper bound in (11), is To this end, we must establish a context in which the data pairs (x [k] , y [k] ) can be considered to be generated. Let ρ X ,Y be a probability distribution on X ×Y, with marginals ρ X , ρ Y and conditional ρ Y|X . In this theoretical investigation we suppose that all data are generated independently from ρ X ,Y , with the values {y [k] : [k] ∈ U} being withheld.

Appendix B. Consistency of the Classifier
For a generic classifier c : X → {−1, +1}, define the misclassification rate This is minimized by c ρ (x) := sign(f ρ (x)) where f ρ : X → Y is the (typically unavailable) regression function Thus the quantity R(c ρ ) captures the intrinsic difficulty of the classification task. A classifier c is said to be consistent (either in expectation, with high probability, etc.) if R(ĉ) → R(c ρ ) in the limit m L → ∞ of infinite labelled data (with convergence either in expectation, with high probability, etc.). Our consistency argument is based around the following straightforward bound: Lemma 6 Fix > 0 and let X := {x ∈ X : |f ρ (x)| < }. Then where ρ X (X ) denotes the ρ X -measure of the set X .
Next we leverage an existing high-probability consistency result established in the regression (as opposed to classification) context: Theorem 7 Suppose f ρ is non-constant and that Σ − α 2 K f ρ ∈ L 2 (ρ X ) for some α ∈ (0, 1]. Let θ = 1 (1+α)(1+s) . Take λ 1 = m θ U and λ 2 = m θ L . Then there exists a finite constant C such that for any δ ∈ (0, 1), and for m L , m U sufficiently large, we have with probability at least Proof This result is an immediate consequence of Theorem 5.6 in Cao and Chen (2012), whose bound on the L 2 (ρ X ) error clearly also implies a bound on the L 1 (ρ X ) error. In addition, since our intention in what follows is limited to establishing consistency of the proposed classification method, as opposed to a detailed convergence rate analysis, we have simplified the presentation by stating a slightly weaker but less-verbose upper bound.
Note how the "for m U sufficiently large" condition in Theorem 7 will typically be automatically satisfied in our context, where the amount of unlabelled data is m U = O(p 2 ). Thus the content of (14) is control overf −f ρ as the number m L of labeled data is increased.
Corollary 8 Under the same assumptions as Theorem 7, we have with probability at least 1 − 8δ that Corollary 8 makes explicit how the intrinsic difficulty of the classification task depends on the form of f ρ , and in particular the extent to which |f ρ (x)| < occurs in X . For typical regression functions f ρ with simple roots in X , it will hold that ρ X (X ) = O( ). An assumption of this form can therefore be used to complete a high probability consistency argument: Corollary 9 (Consistency of the Classifier) Suppose that ρ X (X ) = O( γ ) for some γ > 0. Under the same assumptions as Theorem 7, there exists a finite constantC such that, with probability at least 1 − 8δ, In particular, this establishes that the classifierĉ is (with high probability) consistent.
Proof From the hypothesis, ∃B 1 , 1 such that ρ X (X ) ≤ B 1 γ for all < 1 . Thus, for < 1 the difference R(ĉ) − R(c ρ ) can be bounded via (15) as Differentiating J and setting to zero reveals that J is minimized over (0, ∞) at * = B 2 γB 1 1 1+γ , which satisfies * < 1 for m L sufficiently large (recall that m L being sufficiently large was an assumption of Theorem 7). Thus, for m L sufficiently large, which, upon substitution for B 2 , yields the required result with the value for the constant Results for PC (which returns a point estimate) are shown as locations on the ROC plane. "TC" indicates use of a transitive closure operation and "cnstrnts" indicates that the background information Φ was included via input constraints. The "TC" results are included here for completeness, but we note that the reference graph here encodes direct, rather than ancestral, relationships. [Results shown are for significance level α = 0.01 and for a lenient interpretation where possible edges are included. Results are averages over 25 iterations.]